Network Support Library

home *** CD-ROM | disk | FTP | other *** search

/ Network Support Library / RoseWare - Network Support Library.iso / pressgen / token.err < prev next >

Wrap

Text File | 1992-09-14 | 7KB | 175 lines

This incident report is provided as supporting documentation for a message left on NetWire on 9/9/92. The message describes some problems with the new Token Ring drivers for 3.11, and omissions in the update documentation shipped with the drivers. The following document referrs to several servers by name. NN_server, and DEVELOPMENT_server and NW 3.11 servers. IBM PS/2 model 95s w/ 16MB RAM. These servers reside on a Token Ring network along with two other NW 2.2 servers. The Token Ring is monitored by Proteon's network monitoring system. The Proteon system reports media aquisition (MAC) layer errors. Please pardon me if the document rambles a bit, but these reports are used for future problem determination and evaluation. Our support group feels that every idea committed to paper helps. Larry Rubanka, 73465.643 INCIDENT REPORT: NN_server, running Netware 3.11 using TOKEN.LAN v3.13 uninterrupted up time, 100+ days. Installing the Netware for SAA v1.2 placed the new TOKEN.LAN v3.16 on NN_server. This new LAN driver was loaded and running for several days. SAA NLM was not running, or loaded. Two similar, probably identical, events occurred within two weeks following the SAA installation. No similar events had occurred before the SAA installation. SYMPTOMS OBSERVED: 1) A number of Receive Congestion (RC) errors reported by the Proteon network monitor. These RC errors were on the ring used by four different servers and over 100 node. 2) Token Ring was passing data, and workstations could log into and use other servers. 3) No errors were reported on any servers, including NN_server. 4) SLIST did not show NN_server yet still showed other servers. 5) Performance, at workstations not using NN_server, was not noticeable affected. 6) SNA gateway was in use and functioning. 7) Workstations attached to and accessing NN_server encountered errors. 8) Packet Receive Buffers on NN_server climbed continuously, and server utilization showed 25%. This is high compared to our normal 5-10% level. 9) Another 3.11-based server, DEVELOPMENT_server, was running TOKEN.LAN v3.13 and was functioning properly. Packet Receive Buffers and percent utilization of DEVELOPMENT_server were normal. SOLUTIONS TRIED: 1) Excluded (disallowed access to the ring) any workstations reporting MAC layer Token Ring errors. No effect. 2) Restarted several workstations that were reporting MAC layer errors. No effect. 3) Eliminated all "Receive Congestion" errors from network by restarting any workstations reporting errors. 4) Shut down NN_server, and restarted it. The problems recurred within ten minutes, after other RC errors. 5a) Unloaded TOKEN.LAN v3.16. Stopped Packet Receive Buffer growth. Percent utilization dropped to zero. 5b) Loaded older TOKEN.LAN v3.13. Resumed server operation without errors. Logged in and out, used SLIST and SNA services. Everything was functioning properly at this point. 5c) Shut down and restarted NN_server to reallocate RAM appropriately (can not deallocated Packet Receive Buffers). No problems since. CONCLUSIONS: 1) The problem seems local to the NN_server. DEVELOPMENT_server, similarly configured, showed no problems, nor did other 2.2-based servers. 2) A significant difference between DEVELOPMENT_server, and NN_server is the TOKEN.LAN version. These servers also use different disk drivers. 3) Receive Congestion errors are rare on our network. In both cases, the incident was preceeded by Receive Congestion errors. Cause or effect? Could the Receive Congestion errors have "triggered" the problem with v3.16 TOKEN.LAN, or were they merely a symptom of the problem? DEVELOPMENT_server, running TOKEN.LAN v3.13, did not have any problems dealing with these errors. 4) In both incidents, the RC errors occurred when a PC experienced an error under QEMM 386. I speculate that QEMM has paused the CPU and that the receive buffer on the TR card is not being serviced. The card continues to function, however, and reports the RC errors. 5) Due to time and availability constraints, unloading and reloading the TOKEN.LAN v3.16 was not tried. However, TOKEN.LAN was stopped and restarted via the NN_server shut down and restart. Restarting the server, and hence the driver, did not eliminate the problems. 6) The TOKEN.LAN v3.16 seems to be problem. NOTES: The SAA installation introduced a new version of NMAGENT.NLM. This is not used by other servers. SAA installation was not completed due to errors encountered with configuration values. The documentation for TOKENDMA.LAN describes a condition where the driver pauses execution until beaconing stops. This pause accounts for the queuing of ECBs. I believe that the driver's behavior differs significantly from that which is described in this documentation. After adopting the TOKEN.LAN version 3.16, shipped with SAA, I recognized a problem similar to that described in the update docs. I am making some assumptions to help me work around the omissions in the update documents. I assume the behavior of v3.16 TOKEN is similar to v3.16 TOKENDMA with regards to it's reaction to "Beaconing" errors. I am also assuming that the queueing of ECBs in the send queue would drive up the number of Packet Receive Buffers alocated and/or the Permanent Pool memory. Given these assumptions, I have observed the following behavior of the TOKEN.LAN v3.16 driver. The driver appears to pause when a "RECEIVE CONGESTION ERROR" (RC error) occurrs. The driver does not appear to resume when the RC error is cleared. Several minutes (20 or more) after the node(s) reporting RC errors are removed, the server still is unavailable. In the presence of RC errors, the 3.16 TOKEN.LAN driver appears to pause. The Permanent memory pool continues to grow. Packet Receive Buffers (PRBs) grow. I have not waited for the PRBs to grow up to Max PRB. While the PRBs and Permanent memory grow, the server continues operating. Only Token Ring activity has paused. MONITOR continues to function as well as do other NLMs. If a connection is specified for clearing via MONITOR while the driver is paused, monitor pauses. With monitor paused, the server is STILL functioning. If MONITOR is unloaded from the console, the console pauses. I would expect this behavior if the driver were paused. I presume that if the driver resumed, the clear connection would proceed, and the unload would continue. Clearing the RC errors does not cause the driver to resume, at least not quickly. I waited approximately twenty minutes after eliminating all MAC layer errors, but the driver remained paused. The RC error is reported when a Token Ring interface's receive buffer is not being serviced, and overflows. This is a node error, not a ring error. Token Ring traffic continues in the presence of RC errors. To the best of my knowlege, this is not "Beaconing." While the RC errors persist, the Token Ring continues to pass traffic. Other 3.11 servers running v3.13 TOKEN.LAN continue normaly, as do 2.2 servers.